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Microcanonical thermostatistics analysis has become an important tool to reveal essential aspects 
of phase transitions in complex systems. An efficient way to estimate the microcanonical inverse 
temperature /3(E) and the microcanonical entropy S(E) is achieved with the statistical temperature 
weighted histogram analysis method (ST-WHAM). The strength of this method lies on its flexibility, 
as it can be used to analyse data produced by algorithms with generalised sampling weights. 

However, for any sampling weight, ST-WHAM requires the calculation of derivatives of energy 
histograms H(E ), which leads to non-trivial and tedious binning tasks for models with continuous 
energy spectrum such as those for biomolecular and colloidal systems. Here, we discuss two 
alternative methods that avoid the need for such energy binning to obtain continuous estimates for 
H(E) in order to evaluate /3(E) by using ST-WHAM: (i) a series expansion to estimate probability 
densities from the empirical cumulative distribution function (CDF), and (ii) a Bayesian approach 
to model this CDF. Comparison with a simple linear regression method is also carried out. The 
performance of these approaches is evaluated considering coarse-grained protein models for folding 
and peptide aggregation. 
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I. INTRODUCTION 

Fundamental aspects of phase transitions in complex 
systems can be revealed by the analysis of its microcanon¬ 
ical thermostatistics lb 2 , which is characterised by the 
well known entropy S(E) = /^lnf^E), where Ul(E) 
denotes the density of states of a system with energy 
E. In particular, the analysis of inflection points of the 
microcanonical inverse temperature /3(E) = dS(E)/dE 
plays an important role in the identification of stable, 
unstable and metastable regions in the phase diagram El 


evaluate the values of barrier heights and latent heats. 
In this way, the microcanonical thermostatistics analysis 
has been incorporated in many studies in the literature, 
e.g. Refs. f6M2] to name a few. 

Nevertheless, any analysis must rely on data obtained 
from efficient exploration of the configurational space. 
It is well known that numerical simulations performed 
with Monte Carlo (MC) and molecular dynamics (MD) 
methods pose limitations to the achievement of reliable 
data sampling m- Such limitations are related to the 
critical slowing down 0 , which is observed in studies of 
continuous phase transitions, and to the entrapment in 
local minima, in the case of systems with rugged energy 
landscapes. In both cases, the configurational space is 
poorly explored in a reasonable computational simulation 
time, which may produce biased physical averages. To 
overcome the trapping problem, it has been suggested 


|5], providing alternative insights to the usual canonical 
analysis. Also, free-energy profiles can be obtained from 
the caloric curves (3 vs E, from where one can easily 


that configurations must be sampled using algorithms 
based on generalised ensembles, where the updates are 
performed with non-Boltzmann statistical weights u(E). 
For instance, the multicanonical algorithm (MUCA) 
[HJ[T6], the extended Gaussian ensemble (EGE) [T714T9] . 
Tsallis statistical weight [201 |2T] . and replica exchange 
method (REM) [22] either use a series of Boltzmann 
weights or any convenient generalised sampling weight 

m■ 

MUCA simulations sample configurations with a 
weight (^^(E) in such a way that the energy distribution 
is uniform, H rnu (E) oc Q(E)u mu (E) = constant. Thus, 
a precise determination of u) mu (E) is equivalent to obtain 
an estimate for the density of states Q(E), i.e. uj mu (E) oc 
l/Ut(E). The weights uj rnu (E ) = exp [—b(E)E + cl(E)\ 
follows from the parameterization of the entropy S(E) = 
b(E)E — a(E), where a(E) and b(E) are the so-called 
multicanonical parameters. The iterative procedure to 
obtain the MUCA parameters is described in detail in 
references 00 , and can be read as, 

a n {E m -i) = a n (E m ) + [b n (E m _ t ) - b n (E m )]E m , (1) 

b n (E m ) = b n ~\E m ) + [In H^~\E m+1 ) - In H^{E m )]/e, 

( 2 ) 

where n stands for the nth multicanonical simulation. 
The recursion steps require the discretisation of the 
energy for continuous energy models. Therefore, it is 
convenient to define £ , m = Eo + me, where 5 is the 
binsize, m is an integer, and Eq is a constant that defines 
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a reference energy. All the energies E in the interval 
[E m ,E m + 1 [ contribute to the histogram H rnu (E rn ). 

Methods to improve sampling based on simulations 
at different temperatures have been proposed to either 
be conducted in parallel (REM) or as a random walk 
between different temperatures. In REM, 7V rep non¬ 
interacting replicas of the system are simultaneously 
simulated by the usual MC or MD algorithms, and 
from time to time, pairs of replicas at neighboring 
temperatures are exchanged with a transition probability. 
From the data produced by simulations performed at 
a single temperature T\ or at a set of temperatures 
T a , with a = 1,2, ••• , 7V rep , it is necessary to employ 
a reweighing scheme to evaluate physical averages at 
a given temperature T. Reweighting techniques El b 
126] use data from either a single histogram or multiple 
histograms obtained from MC or MD simulations. 

Recently, a simple method called statistical weighted 
histogram analysis method (ST-WHAM) [27. has been 
proposed as an iteration-free procedure to obtain an 
estimate for the microcanonical inverse temperature. 
In this method the usual WHAM equations [24] [25] 
are converted into a weighted average of the individual 
densities of states obtained from simulations carried out 
with different sampling weights uj(E). From energy his¬ 
tograms produced by multiple simulations, ST-WHAM 
yields a statistical temperature T(E) = 1//3(E), which 
is an estimate of the inverse microcanonical temperature 
/3(E) = dS(E)/dE. Interestingly, there is a numerical 
procedure based on the multicanonical recursion relations 
([l]) and ([2|, which is called ST-WHAM-MUCA [28], that 
can be used to replace the direct integration in order to 
evaluate the entropy S(E). Although both ST-WHAM 
and ST-WHAM-MUCA have the advantage of a posteri¬ 
ori discretisation of energies, their naive implementations 
may lead to biased evaluations of physical quantities for 
continuous energy models just like all the aforementioned 
algorithms. 

As described in Section II, the estimates 13(E) for 
inverse microcanonical temperature /3(E) depends on the 
derivatives of the energy histograms H(E) (see Eq. ©)• 
Here, we analyse how the estimates /3(E) are energy 
binning dependent and, in Section III, we present two 
alternative approaches that avoids the need for energy 
binning to evaluate the microcanonical caloric curve 
for continuous energy models: (i) a proposal by Berg 
and Harris [29], which involves an empirical cumulative 
distribution (CDF) and uses discrete Fourier series; and 
(ii) a Bayesian approach [3Q to model this CDF and 
assumes that the thermodynamic phase transitions are 
well described by the coexistence of two phases. A com¬ 
parative analysis between these approaches is made in 
order to characterise /3(E) for two systems that undergo 
first-order-like phase transitions: the folding transition 
of a coarse-grained protein model for Ubiquitin and 
the aggregation transition of two heteropolymers that 
follows a Fibonacci sequence. These examples allow us to 
describe the statistical and systematic errors involved in 


the numerical calculations of H(E) and /3(E), which are 
presented in Section IV. Conclusions on this comparative 
analysis are presented in Section V. 


II. STATISTICAL TEMPERATURE WEIGHTED 
HISTOGRAM METHOD 


The ST-WHAM [27] yields a direct estimate of the in¬ 
verse microcanonical temperature /3(E) = d\nQ(E)/dE 
by considering the statistical inverse temperature 

+ ( 3 ) 


where /* = H,J J2-, PS = d\nH a /dE, and p“ = 
—d\nuj a /dE. It is preferable to write Eq. © as 


m 


1 u ( t?\ (d\nH a (E) d lmj a (E) ^ 

“ V iE - d^)' 


( 4 ) 

Note that /?£ = 1/T a for simulations with the canonical 
weight. With the set of estimates j3(E m ), MUCA 
recurrence relations ([l]) and (J2| can be applied to obtain 
estimates S(E m ) for the microcanonical entropy S(E rn ), 


S(E m ) = P(E m )E m - a(E m ) . (5) 


This ST-WHAM-MUCA algorithm is quite simple if one 
has (3(E m )- 


III. NUMERICAL EVALUATION OF 
DERIVATIVES 

A. Linear regression 

We can numerically evaluate the derivatives in Eq. © 
in a naive way, where the derivatives d\nH(E)/dE at 
energies E m follow from a linear regression around this 
point. For instance, we use a linear regression with k = 
15 points; selecting k points means that the derivative 
at Em is calculated with the values of H(E$), where 
£ = m — (k — l)/2, m + 1 — (k — l)/2, • • • , m, • • • , m + 
(k — l)/2. We chose a value for k according to the energy 
binsize e. Consequently, we calculate the derivatives in 
the energy range A E = (k — l)e. In this method, it 
is more convenient to directly calculate the derivative of 
In H(Em) than the derivative of H(E m ) • We calculate the 
linear regression with a subroutine easily adapted from 
the linear fit subroutine in m- 


B. Cumulative distribution method 

Another approach can be devised by considering an 
algorithm based on the cumulative distribution function 
(CDF) [29] . The advantage of such approach is that 
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it avoids histogramming when describing probability 
densities P(E ), dismissing the need for any ad hoc energy 
discretisation. The method defines an estimator F(E) 
for the CDF F(E ), where the function F(E) is an 
empirical cumulative distribution function (ECDF) for 
the probability density P(E). The algorithm sorts the 
energy time series of length NDAT in an ascending order 
(Ei < E 2 < • • • < -Fndat ) 5 so any outliers can be 
eliminated by constructing a restricted ECDF F a b(E) in 
the range between two meaningful points a and b (in 
general one takes a — E\ and b = Endat)- The basic 
idea is to propose an approximating function Fq(E) to 
describe F a b(E ), from where the difference function is 
defined, 


R(E) = F ab (E) - F 0 (E). (6) 

This function can be expanded in Fourier series, 

MMAX / , „ 

«(£)=£<(7) 

m =1 ' ' 

which gives the Fourier coefficients m, 

iim) =CC [ r{e) ™ (HH) dE ■ < 8 » 

Here we use the same criteria of [29| where the maximum 
number MMAX of coefficients d(m ) is obtained by imposing 
the two-sided Kolmogorov test m- 

Notably, Eq. <© provides all the information that 
one needs to obtain a parameter-free energy probability 
density P(E). The reason is because Eq. © yields a 
smooth estimate of the CDF given MMAX coefficients in 
the Fourier expansion, 

F ab (E) = F 0 (E) + R(E) , (9) 


C. Bayesian analysis 


Next we present our approach to implement a Bayesian 
analysis. To this end, input data (denoted by y) is the 
empirical cumulative distribution and not the histograms 
that are dependent on the energy binsize 5. Thus, the aim 
is to describe the empirical cumulative distribution by a 
given model (called model M) and using the Bayesian 
analysis m , extract values for its parameters (denoted 
by A) from the posterior probability density function 
(PDF) P(A, v\ D , M). Here, V denotes possible parame¬ 
ters related to the experimental conditions, which is the 
temperature in our numerical experiment, and D refers 
to a set of observed experimental points {Ei}. From 
the posterior PDF, we extract the desired probability 
distributions for each parameter, 

P(Xi\D,M) = j P{\,v\D,M)d\ (jlti) dv, (11) 


which correspond to the marginalized distributions. 
These marginalized distributions are depicted with his¬ 
tograms in Fig. [6] for the Ubiquitin. 

For each protein, our experimental data y is produced 
by dividing the observed energy range into 35 points {Ei} 
and experimental errors are extracted from the jackknife 
procedure. We anticipate here that these experimental 
data are displayed in Figures [4] and [10] for the folding¬ 
unfolding transition of Ubiquitin and for the aggregation 
of two Fibonacci sequences, respectively. Within a 
Bayesian approach, the set of observations {yi} is just 
a set among possible experimental outcomes y at the 
points {Ei}. Thus, given a fixed model M and a set 
of experimental values y({Ei }), parameters estimation 
within Bayesian approach is obtained as follows m, 


even if one assumes a linear ansatz for Fq(E) (see 
[215]). In turn, it becomes easy to obtain estimates for 
the probability density P a b(E ) by differentiation, which 
consists in a smooth estimation of H(E). 

Now, let us go back to the numerical differentia¬ 
tion in Eq. ©• To obtain parameter-free derivatives 
dH(E)/dE , we just need to differentiate F ab (E) again. 
This corresponds to the following numerical calculation, 


d 2 F ab (E) 
dE 2 


MMAX 


= - F ah 


m= 1 


mir 


sm 


m7r(E — 



( 10 ) 

Unlike the linear regression method, where we calculate 
the derivative of \nH(E ), here is more convenient to 
compute the derivative of H(E ) directly. We have 
included in Appendix the FUNCTION DERPD(X), which 
can be used to estimate this derivative from the Fourier 
coefficients d(m). One can easily incorporate this func¬ 
tion in the program CDF_PD subroutine provided by Berg 
and Harris m- 


P(X , * | D, M) = = ) 

f P({Eij = D | A, v, M) P 0 (X, v | M) dX dv 

( 12 ) 

where Pq(\,v\M) is the prior distribution for a fixed 
model M. 

Since our aim is to characterise P(E) in a first-order 
phase transition region, we assume that the energy 
distribution P(E) ( i.e . a continuous estimate for H(E ) 
in Eq. Q) is well described by a double-peak function. 
This function characterises the coexistent phases at tem¬ 
peratures close enough to the transition temperature T c , 
whose shape is a consequence of the free-energy barrier 
between the ordered and disordered phases. Therefore, 
the expected energy distribution P(E) can be written 
as a normalized sum of two Gaussian distributions with 
peaks centered at energies yi and /i2, corresponding to 
the disordered and ordered phases, respectively [32]. If 
we assume the model M as the cumulative distribution 
function of P(E ), and recalling that for a Gaussian 
distribution the CDF is the error function, we arrive to 
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the following modelling, 


F(E) = 0.5 


1 


^erf 


E - ii i 

V2 


si 


+ (1 — a) erf 


E - /i 2 


S2V2 

(13) 

This is a 5-parameter model, which we call model I in 
our analysis, with A = (//i, si, /i 2 , <§ 2 , a). 

Now, to obtain the posterior probability of the parame¬ 
ter set A, given data set T>, from the likelihood Pi+i{D | A) 
times the initial prior probabilities Pi (A) for our model 
I, we use a Markov chain Monte Carlo. To this end, 
we consider the Metropolis algorithm with the proposal 
function 


P0 | A) = TT —= e -[w*-^(^)] a /2a? ; ( 14 ) 

CTiV 27T 

where <Ji are the errors in our computational experiment. 
The values F(Ej) are calculated with the accepted values 
for the parameters A of model I. 



FIG. 1. Statistical temperature estimates /3(E) of Ubiq- 
uitin using data obtained from MD simulations at T — 
1.20. Curves show the dependence of /3(E) on energy 
discretisations e (s = 0.3,0.8 and 1.3) for the linear 
regression method with k = 15 points. 


IV. NUMERICAL COMPARISONS 

In the following, we present results where we evaluate 
the performance of the different approaches in obtaining 
continuous estimates for H(E) and, consequently, the 
estimate /3(E) for microcanonical inverse temperature. 
The comparative analysis is made with data obtained 
from both MC and MD simulations, where we employ 
two simplified off-lattice protein models, i.e. all atoms in 
the polypeptide chain are replaced by beads located at 
the C^-atom position, and present a continuous energy 
spectrum. 

A. Ubiquitin 

Ubiquitin is a 76 residue small protein (PDB code 
1 UBQ) for which evidences suggest a two-state folding 
mechanism. It had received considerable attention for 
what concerns solvent response and temperature depen¬ 
dence in the secondary structure formation _ 33H35j and 
stretching experiments [36H38] . 

To study the performance of the numerical approaches 
in evaluating f3(E) through the folding-unfolding tran¬ 
sition, we perform simulations with a coarse-grained 
version for Ubiquitin but with a rather detailed poten¬ 
tial. We use the structure-based model implemented in 
SMOG@ctbp web server 39 to perform MD simulations 
with GROMACS package. The analyses were made from 
10 6 measurements during 250 ns of simulation after a 
period of thermalization. Measurements were obtained 
from 7 independent simulations at temperatures T a = 
Ti + (a - 1) AT with T x = 1.14 and AT = 0.02. 

First, we evaluate the performance of the approaches 
considering the data obtained from a single MD tra¬ 
jectory produced at T 4 = 1 . 20 , which is close to the 



FIG. 2. Comparison between estimates j3(E) of Ubiquitin 
with the linear regression (e = 0.8) and CDF method for 
data obtained at T = 1.20. Statistical errors for the CDF 
method were calculated with 20 jackknife bins. 


transition temperature T c = 1.219(3) evaluated with 
statistics obtained with 7 independent MD simulations. 
Figure [l] displays the estimates for the caloric curve using 
the linear regression to evaluate derivatives of In H(E rn ) 
from data obtained at T = 1.20. This figure shows how 
noisy the caloric curve can be as a function of the energy 
binsize 5 = 0.3, 0.8 and 1.3, keeping the number of points 
k = 15 fixed. Figure [2] compares what seems a good 
energy discretisation with the CDF method. Error bars 
based on 20 jackknife bins are presented only for the 
CDF method. Now, Fig. [^compares both methods when 
we consider the statistics obtained with the entire set of 
temperatures Tfi, (a = 1, 2, • • • ,7). To present error bars 
for the CDF method, we have increased the number of 
jackknife bins to 80 because this larger statistics. Nice 
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FIG. 3. Estimates /3(E) for Ubiquitin obtained with lin¬ 
ear regression (e = 0.8) and CDF method with data 

generated from independent MD simulations at T = 
{1.14,1.16,1.18,1.20,1.22,1.24,1.26}. Statistical errors for 
the CDF method were calculated with 80 jackknife bins. 


agreement between both methods are obtained if we 
choose £ conveniently. 

Figure [4] displays what we call input experimental 
points to perform the Bayesian analysis. This experi¬ 
mental data was obtained from MD simulations at T = 
1.20. The dashed line corresponds to the cumulative 
distribution obtained with model I. The parameters of 
this model that fit the input data are displayed in 
Table 1. We include values calculated from means of 
the marginalized distributions, and global modes that 
presented the smallest y 2 /d.o.f of the model. To verify 
how the proposed model describes the energy distribution 
when compared with the CDF method and naive his- 
togramming, we show in Fig. [5] the results obtained from 
these methods for the statistics collected at T = 1.20. We 
realize that model I does not recover properly the region 
between the peaks of the energy distributions H(E ) 
obtained with previous numerical methods. The inset 
in this figure compares the calculations of dH(E)/dE 
following from the CDF and Bayesian approaches. In 
Fig. [i] we display the marginalized distributions of A, 
including the correlation between the parameters s i and 
a to illustrate their interdependence. We have always 
used flat priors over appropriated ranges to obtain the 
parameter distributions. 


B. Fibonacci sequences 

Figure [7] shows the caloric curve for the aggregation 
of two chains of monomers described by the AB model, 
a coarse-grained off-lattice protein model that replaces 
the all-atom potentials by simpler physical interactions. 
This model reduces the protein to a chain of monomers 
of two types: the hydrophobic (A) and the polar or 


. . 

o empirical data T= 1.20 ^ 



y 


¥ 

0 — i i i i 1 i i i i 1 i i i ■ 1 . t . i I i i i ■_ 

50 100 150 200 250 300 

E 

FIG. 4. Cumulative distribution data (o) of Ubiquitin 
with error bars calculated with 20 jackknife bins from 
MD simulation at T m 1.20. Dotted line corresponds 
to the Bayesian estimates of the cumulative distribution 
assuming model I. 



FIG. 5. Energy histograms H(E ) for Ubiquitin. Er¬ 
ror bars are calculated with 20 jackknife bins and are 
shown only for the CDF method. Inset: comparative 
behaviour of dH(E)/dE produced with the CDF method 
and Bayesian approach assuming model I. 


hydrophilic ( B ) types, located at the C a atoms. The 
AB model has also been used in studies to explore aggre¬ 
gation phenomenon mm, where more than one chain is 
included in the system, and in studies of microcanonical 
thermostatistics of heteropolymers [6]. 

Here, we consider a simple system which consists of 
two heteropolymers defined by Fibonacci sequences with 
13 monomers, i.e. ABBABBABABBAB. The statistics for 
this system amounts to 10 7 sweeps per replica produced 
with REM. Attempts to exchange the replicas occur 
every n s = 10 4 sweeps, using a scheme that alternates 
attempts between even replicas and their neighbors, and 
odd replicas and their neighbors. Although one has 
several manners to define the temperature set [20], here 
we determine it using an arithmetic progression f3 a = 
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FIG. 6. Marginalized PDFs of the parameters for model I, 
and scatter plot for joint probability of a and s 1 with data 
obtained for Ubiquitin at T = 1.20. 


Pi + (a — 1) A/3. We consider 3V rep = 7 replicas choosing 
Pi = 4.2, and A p = 0.4. 

Figure [ 7 ] depicts the caloric curves P(E) obtained with 
energy binsizes 5 = 0.02,0.05, and 0.10. The value 
k = 15 was used to calculate the derivatives of In H(E m ). 
Different values of 5 allows one to verify the systematic 
errors associated with 6 and the use of linear regression. 
For a fixed &, small values of 5 (or A E) tend to reproduce 
the statistical fluctuations of the data. On the other 
hand, a value as large as 5 = 0.10 eliminates the physical 
information associated with the small S-loop at E ~ 0.2. 
For comparison, we include in Fig. [8] the P(E) estimates 
from the CDF method. These comparisons are based 
on a single time series selected from data produced with 
REM at the inverse temperature P = 5.0. It is important 
to reliably calculate the caloric curve when assessing 
the canonical critical temperature and the latent heat 
of the transition. We can figure out how important 
the fluctuations observed in these curves P(E) are by 
calculating the statistical errors associated with the CDF 
method. Figure [8] displays the statistical error bars 
calculated with 20 jackknife bins for the dataset with 
10 7 measurements obtained with REM at P = 5.0. Both 
methods present comparable statistical error bars. This 
figure shows that the small S-loop at E ~ 0.2 does not 
result from the statistical fluctuations in our dataset. 
Therefore, any smooth estimate of this curve at E ~ 0.2, 
obtained for example with A E = 1.40 (e = 0.10), would 
hide physical information. 

In Fig. [9] we include data produced with 3V rep = 7 
replicas together. Comparison of the results presented in 
this figure with those obtained by multicanonical simu¬ 
lations in Refs. in m reveals that they are in excellent 
agreement: even the small S-loop near E ~ 0.2 appears in 
our case. Computation of the inverse of aggregation tem¬ 
perature /3 agg = 5.212(59) also resulted in a quantitative 
agreement. Larger error bars appear around E = —4, 



FIG. 7. Estimates /3(F) for the aggregation of two 
Fibonacci sequences for different parameters e and k = 15 
points with data selected for p — 5.0 among time series 
produced with REM. 



FIG. 8. Comparison between estimates P(E) for the 
aggregation of two Fibonacci sequences with the linear 
regression (e ss 0.05) and CDF method for data obtained 
at P — 5.0. Statistical errors for the CDF method were 
calculated with 20 jackknife bins. 


probably because the lack of temperatures in the set 
used to exchange the replicas. The averaged estimates 
from both derivative calculation methods agreed for a 
particular choice of 5 = 0.05 and k = 15 points in the 
linear regression method. 

Now, Fig. [lO] shows the experimental data we have 
used to perform the Bayesian analysis. Again, we 
initially keep the analysis restricted to a single energy 
time series, in this case produced at P = 5.0. We 
have included the Bayesian estimates (dashed line) of 
the cumulative function assuming the model I. The 
parameters for this model are presented in Table I. To 
verify how well this model reproduces P(E ), we include 
in Fig. [TT] the energy distributions obtained with a naive 
histogramming procedure, CDF method and Bayesian 
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FIG. 9. Comparison between estimates /3(E) for the aggre¬ 
gation of two Fibonacci sequences with the linear regression 
(e = 0.05) and CDF method for energy time series obtained 
from REM at p = {4.2,4.6, 5.0, 5.4, 5.8,6.2,6.6}. 



FIG. 10. Cumulative distribution data (o) for the 
aggregation of two Fibonacci sequences with error bars 
calculated with 20 jackknife bins from REM simulation 
for the energy time series obtained at /3 = 5.0. Dotted line 
corresponds to the Bayesian estimates of the cumulative 
distribution assuming model I. 


analysis. Clearly, Bayesian analysis does not reproduce 
the expected behaviour between both peaks. This leads 
to a biased estimation of dH(E)/dE as can be seem in 
the inset of this figure. Moreover, the approach with 
model I does not reproduce the small S-loop observed at 
E ~ 0.2 (data not shown), in contrast to the estimates 
with a convenient choice of 5 or the CDF method. As 
a consequence, we again do not follow further analysis 
with the Bayesian approach including all statistics from 
the 7 replicas. 


V. SUMMARY AND CONCLUSIONS 



We presented two alternative approaches for the esti¬ 
mation of the energy probability distributions avoiding 
the need of energy binning in order to obtain the mi- 
crocanonical thermostatistics analysis from ST-WHAM. 
Numerical comparisons between the approaches were 
presented and we showed the statistical and systematic 
errors that can arise when one evaluates the micro- 
canonical inverse temperature for two continuous energy 
models that exhibits first-order like phase transitions. 
Our results indicate that the CDF method yields reliable 
estimates for both H(E) and /3(E) when compared with 
the linear regression method, and far more successfully 
than the Bayesian approach with model I. Unlike the 
linear regression method, the CDF method avoids the 
undesirable task of choosing the energy binsize 6 for a 
careful evaluation of the microcanonical temperature, as 
highlighted in the analysis of the aggregation transition 
of the two Fibonacci sequences. In this case, the 
caloric curve presents a quite unusual behaviour with 
two S-loops, indicating two transitions. In particular, 
we showed that the use of large values for e in linear 
regression method would hide physical information about 


FIG. 11. Comparison of energy histograms H(E) for the 
aggregation of two Fibonacci sequences. Error bars are 
calculated with 20 jackknife bins are shown for the CDF 
method. Inset: comparative behaviour of dH(E)/dE as 
predicted with the CDF method and Bayesian analysis 
assuming model I. 


the small S-loop at E ~ 0.2. On the other hand, the small 
S-loop could not be detected with the Bayesian analysis 
(data not shown) as a consequence of the poor evaluation 
of H(E) and its derivative (Fig. 11). The reason is 
because the model I consists of only 5 parameters. This 
is a low number compared to the CDF method, which 
allows a variable number of parameters (i.e. Fourier 
coefficients) defined depending on the demand of the 
ECDF. For instance, the CDF method needs 13 Fourier 
coefficients to obtain probability densities for the Ubiq- 
uitin data and this number goes up to 74 for Fibonacci 
sequences. Furthermore, model I was constructed on the 
hypothesis that the coexistence of two thermodynamic 

















bulk phases can be well described by two gaussian 
distributions. Actually, this is an approximation because 
the energies in between the two peaks, describing mixed 
phase configurations, are not properly take into account 
in the two gaussian model. 
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VI. APPENDIX 

Derivative of the probability density: 

FUNCTION DERPD(X) 

include /Libs/Fortran/implicit.sta ; 

include , ../../Libs/Fortran/constants.par * 
PARAMETER(NMAX=100) 

COMMON /CDFProb/ XMIN,XRANGE,DN(NMAX),M 
!M number of Fourier coefficients 
DERPD=ZER0 
DO J=1,M 

AUX=J*PI/XRANGE 

AUX=AUX*AUX 

DERPD=DERPD-DN(J)*AUX*SIN(J*PI/XRANGE*(X-XMIN)) 
ENDDO 
RETURN 
END 
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Examples 


Mi 

Si 

M2 

S 2 

a 

X 2 /d.o.f. 

Ubiquitin 

mean 

250.36(37) 

22.73(22) 

102.64(27) 

22.31(15) 

0.4474(31) 

0.131 


global mode 

250.51 

22.64 

102.55 

22.27 

0.4474 

0.130 

Fibonacci 

mean 

0.797(15) 

0.7786(79) 

-8.128(37) 

1.968(25) 

0.4838(35) 

0.350 

sequences 

global mode 

0.803 

0.7754 

-8.147 

1.958 

0.4841 

0.349 


TABLE I. Parameters for the proposed model in Eq. (13) from averages and the x /d.o.f of the model. 








